Method and system for providing a polymorphism database

ABSTRACT

Systems and methods for organizing information relating to study of polymorphisms. A database model is provided which interrelates information about one or more of, e.g, subjects from whom samples are extracted, primers used in extracting the DNA from the subjects, about the samples themselves, about experiments done on samples, about particular oligonucleotide probe arrays used to perform experiments, about analysis procedures performed on the samples, and about analysis results. The model is readily translatable into database languages such as SQL. The database model scales to permit storage of information about large numbers of subjects, samples, experiments, chips, etc.

CROSS-REFERENCE TO RELATED APPLICATIONS

[0001] The present application claims priority from U.S. Prov. App. No.60/053,842 filed Jul. 25, 1997, entitled COMPREHENSIVE BIO-INFORMATICSDATABASE, from U.S. Prov. App. No. 60/069,198 filed on Dec. 11, 1997,entitled COMPREHENSIVE DATABASE FOR BIOINFORMATICS, and from U.S. Prov.App. No. 60/069,436, entitled GENE EXPRESSION AND EVALUATION SYSTEM,filed on Dec. 11, 1997. The contents of all three provisionalapplications are herein incorporated by reference.

[0002] The subject matter of the present application is related to thesubject matter of the following three co-assigned applications filed onthe same day as the present application. GENE EXPRESSION AND EVALUATIONSYSTEM (Attorney Docket No. 018547-035010), METHOD AND APPARATUS FORPROVIDING A BIOINFORMATICS DATABASE (Attorney Docket No. 018547-033810),METHOD AND SYSTEM FOR PROVIDING A PROBE ARRAY CHIP DESIGN DATABASE(Attorney Docket No. 018547-033830). The contents of these threeapplications are herein incorporated by reference.

BACKGROUND OF THE INVENTION

[0003] The present invention relates to the collection and storage ofinformation pertaining to chips for processing biological samples andthereby identifying polymorphisms.

[0004] The genomes of all organisms undergo spontaneous mutation in thecourse of their continuing evolution generating variant forms ofprogenitor sequences (Gusella, Ann. Rev. Biochem. 55, 831-854 (1986)).The variant form may confer an evolutionary advantage or disadvantagerelative to a progenitor form or may be neutral. In some instances, avariant form confers a lethal disadvantage and is not transmitted tosubsequent generations of the organism. In other instances, a variantform confers an evolutionary advantage to the species and is eventuallyincorporated into the DNA of many or most members of the species andeffectively becomes the progenitor form. In many instances, bothprogenitor and variant form(s) survive and co-exist in a speciespopulation. The coexistence of multiple forms of a sequence gives riseto polymorphisms.

[0005] Despite the increased amount of nucleotide sequence data beinggenerated in recent years, only a minute proportion of the totalrepository of polymorphisms in humans and other organisms has so farbeen identified. The paucity of polymorphisms hitherto identified is dueto the large amount of work required for their detection by conventionalmethods. For example, a conventional approach to identifyingpolymorphisms might be to sequence the same stretch of oligonucleotidesin a population of individuals by dideoxy sequencing. In this type ofapproach, the amount of work increases in proportion to both the lengthof sequence and the number of individuals in a population and becomesimpractical for large stretches of DNA or large numbers of persons.

[0006] Devices and computer systems for forming and using arrays ofmaterials on a substrate have been developed. These devices and systemshave been used for identifying polymorphisms. For example, PCTapplication WO92/10588, incorporated herein by reference for allpurposes, describes techniques for sequencing or sequence checkingnucleic acids and other materials. Arrays for performing theseoperations may be formed in arrays according to the methods of, forexample, the pioneering techniques disclosed in U.S. Pat. No. 5,143,854and U.S. Pat. No. 5,571,639, both incorporated herein by reference forall purposes.

[0007] According to one aspect of the techniques described therein, anarray of nucleic acid probes is fabricated at known locations on a chipor substrate. A fluorescently labeled nucleic acid is then brought intocontact with the chip and a scanner generates an image file indicatingthe locations where the labeled nucleic acids bound to the chip. Basedupon the identities of the probes at these locations, it becomespossible to extract information such as the identity of polymorphicforms in of DNA or RNA. Such systems have been used to form, forexample, arrays of DNA that may be used to study and detect mutationsrelevant to cystic fibrosis, the P53 gene (relevant to certain cancers),HIV, and other genetic characteristics.

[0008] It would be highly useful to apply such arrays to the study ofpolymorphisms on a large scale. For example, it would be useful toconduct large scale studies on the correlation between certainpolymorphisms and individual characteristics such as susceptibility todiseases and effectiveness of drug treatments. To achieve thesebenefits, it is contemplated that the operations of chip design,construction, sample preparation, and analysis will occur on a verylarge scale. The quantity of information related to each of these stepsto store and correlate is vast. For large scale polymorphism studies, itwill be necessary to store this information in a way to facilitate lateradvantageous querying and retrieval. What is needed is a system andmethod suitable for storing and organizing large quantities ofinformation used in conjunction with polymorphism studies.

SUMMARY OF THE INVENTION

[0009] The present invention provides systems and methods for organizinginformation relating to study of polymorphisms. A database model isprovided which interrelates information about one or more of, e.g,subjects from whom samples are extracted, primers used in extracting theDNA from the subjects, about the samples themselves, about experimentsdone on samples, about particular oligonucleotide probe arrays used toperform experiments, about analysis procedures performed on the samples,and about analysis results. The model is readily translatable intodatabase languages such as SQL. The database model scales to permitstorage of information about large numbers of subjects, samples,experiments, chips, etc.

[0010] Applications include linkage studies to determine resistance todrugs, susceptibility to diseases, and study of every characteristic ofhumans and other organisms that is related genetic variability. Anotherapplication of a database constructed according to this model is qualitycontrol of the various steps of performing a polymorphism study. Bypreserving information about every step of a polymorphism study, one canassess the reliability of the results or use the preserved informationas feedback to improve procedures.

[0011] A further understanding of the nature and advantages of theinventions herein may be realized by reference to the remaining portionsof the specification and the attached drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

[0012]FIG. 1 illustrates an overall system and process for forming andanalyzing arrays of biological materials such as DNA or RNA.

[0013]FIG. 2A illustrates a computer system suitable for use inconjunction with the overall system of FIG. 1.

[0014]FIG. 2B illustrates a computer network suitable for use inconjunction with the overall system of FIG. 1.

[0015]FIG. 3 illustrates a key for interpreting a database model.

[0016] FIGS. 4A-4H illustrate a database model for maintaininginformation for the system and process of FIG. 1 according to oneembodiment of the present invention.

DESCRIPTION OF SPECIFIC EMBODIMENTS

[0017] Investigation of Polymorphisms

[0018] A. Preparation of Samples

[0019] Polymorphisms are detected in a target nucleic acid from anindividual being analyzed. For assay of genomic DNA, virtually anybiological sample (other than pure red blood cells) is suitable. Forexample, convenient tissue samples include whole blood, semen, saliva,tears, urine, fecal material, sweat, buccal, skin and hair. For assay ofcDNA or mRNA, the tissue sample must be obtained from an organ in whichthe target nucleic acid is expressed. For example, if the target nucleicacid is a cytochrome P450, the liver is a suitable source.

[0020] Many of the methods described below require amplification of DNAfrom target samples. This can be accomplished by e.g., PCR. Seegenerally PCR Technology: Principles and Applications for DNAAmplification (ed. H. A. Erlich, Freeman Press, NY, N.Y., 1992); PCRProtocols: A Guide to Methods and Applications (eds. Innis, et al.,Academic Press, San Diego, Calif., 1990); Mattila et al., Nucleic AcidsRes. 19, 4967 (1991); Eckert et al., PCR Methods and Applications 1, 17(1991); PCR (eds. McPherson et al., IRL Press, Oxford); and U.S. Pat.No. 4,683,202 (each of which is incorporated by reference for allpurposes).

[0021] Other suitable amplification methods include the ligase chainreaction (LCR) (see Wu and Wallace, Genomics 4, 560 (1989), Landegren etal., Science 241, 1077 (1988), transcription amplification (Kwoh et al.,Proc. Natl. Acad. Sci. USA 86, 1173 (1989)), and self-sustained sequencereplication (Guatelli et al., Proc. Nat. Acad. Sci. USA, 87, 1874(1990)) and nucleic acid based sequence amplification (NASBA). Thelatter two amplification methods involve isothermal reactions based onisothermal transcription, which produce both single stranded RNA (ssRNA)and double stranded DNA (dsDNA) as the amplification products in a ratioof about 30 or 100 to 1, respectively.

[0022] B. Detection of Polymorphisms in Target DNA

[0023] There are two distinct types of analysis depending whether apolymorphism in question has already been characterized. The first typeof analysis is sometimes referred to as de novo characterization. Thisanalysis compares target sequences in different individuals to identifypoints of variation, i.e., polymorphic sites. By analyzing groups ofindividuals representing the greatest ethnic diversity among humans andgreatest breed and species variety in plants and animals, patternscharacteristic of the most common alleles/haplotypes of the locus can beidentified, and the frequencies of such populations in the populationdetermined. Additional allelic frequencies can be determined forsubpopulations characterized by criteria such as geography, race, orgender. The second type of analysis is determining which form(s) of acharacterized polymorphism are present in individuals under test. Thereare a variety of suitable procedures, which are discussed in turn.

[0024] 1. Allele-Specific Probes

[0025] The design and use of allele-specific probes for analyzingpolymorphisms is described by e.g., Saiki et al., Nature 324, 163-166(1986); Dattagupta, EP 235,726, Saiki, WO 89/11548. Allele-specificprobes can be designed that hybridize to a segment of target DNA fromone individual but do not hybridize to the corresponding segment fromanother individual due to the presence of different polymorphic forms inthe respective segments from the two individuals. Hybridizationconditions should be sufficiently stringent that there is a significantdifference in hybridization intensity between alleles, and preferably anessentially binary response, whereby a probe hybridizes to only one ofthe alleles. Some probes are designed to hybridize to a segment oftarget DNA such that the polymorphic site aligns with a central position(e.g., in a 15 mer at the 7 position; in a 16 mer, at either the 8 or 9position) of the probe. This design of probe achieves gooddiscrimination in hybridization between different allelic forms.

[0026] Allele-specific probes are often used in pairs, one member of apair showing a perfect match to a reference form of a target sequenceand the other member showing a perfect match to a variant form. Severalpairs of probes can then be immobilized on the same support forsimultaneous analysis of multiple polymorphisms within the same targetsequence.

[0027] 2. Tiling Arrays

[0028] The polymorphisms can also be identified by hybridization tonucleic acid arrays, some example of which are described by WO 95/11995(incorporated by reference in its entirety for all purposes). WO95/11995 also describes subarrays that are optimized for detection of avariant forms of a precharacterized polymorphism. Such a subarraycontains probes designed to be complementary to a second referencesequence, which is an allelic variant of the first reference sequence.The second group of probes is designed by the same principles asdescribed in the Examples except that the probes exhibit complementarilyto the second reference sequence. The inclusion of a second group (orfurther groups) can be particular useful for analyzing shortsubsequences of the primary reference sequence in which multiplemutations are expected to occur within a short distance commensuratewith the length of the probes (i.e., two or more mutations within 9 to21 bases).

[0029] 3. Allele-Specific Primers

[0030] An allele-specific primer hybridizes to a site on target DNAoverlapping a polymorphism and only primes amplification of an allelicform to which the primer exhibits perfect complementarily. See Gibbs,Nucleic Acid Res. 17, 2427-2448 (1989). This primer is used inconjunction with a second primer which hybridizes at a distal site.Amplification proceeds from the two primers leading to a detectableproduct signifying the particular allelic form is present. A control isusually performed with a second pair of primers, one of which shows asingle base mismatch at the polymorphic site and the other of whichexhibits perfect complementarily to a distal site. The single-basemismatch prevents amplification and no detectable product is formed. Themethod works best when the mismatch is included in the 3′-most positionof the oligonucleotide aligned with the polymorphism because thisposition is most destabilizing to elongation from the primer. See, e.g.,WO 93/22456.

[0031] 4. Direct-Sequencing

[0032] The direct analysis of the sequence of polymorphisms of thepresent invention can be accomplished using either the dideoxy chaintermination method or the Maxam Gilbert method (see Sambrook et al.,Molecular Cloning, A Laboratory Manual (2nd Ed., CSHP, New York 1989);Zyskind et al., Recombinant DNA Laboratory Manual, (Acad. Press, 1988)).

[0033] 5. Denaturing Gradient Gel Electrophoresis

[0034] Amplification products generated using the polymerase chainreaction can be analyzed by the use of denaturing gradient gelelectrophoresis. Different alleles can be identified based on thedifferent sequence-dependent melting properties and electrophoreticmigration of DNA in solution. Erlich, ed., PCR Technology, Principlesand Applications for DNA Amplification, (W. H. Freeman and Co, New York,1992), Chapter 7.

[0035] 6. Single-Strand Conformation Polymorphism Analysis

[0036] Alleles of target sequences can be differentiated usingsingle-strand conformation polymorphism analysis, which identifies basedifferences by alteration in electrophoretic migration of singlestranded PCR products, as described in Orita et al., Proc. Nat. Acad.Sci. 86, 2766-2770 (1989). Amplified PCR products can be generated asdescribed above, and heated or otherwise denatured, to form singlestranded amplification products. Single-stranded nucleic acids mayrefold or form secondary structures which are partially dependent on thebase sequence. The different electrophoretic mobilities ofsingle-stranded amplification products can be related to base-sequencedifference between alleles of target sequences.

[0037] Biological Material Analysis System

[0038] One embodiment of the present invention operates in the contextof a system for analyzing biological or other materials using arraysthat themselves include probes that may be made of biological materialssuch as RNA or DNA. The VLSIPS™ and GeneChip™ technologies providemethods of making and using very large arrays of polymers, such asnucleic acids, on chips. See U.S. Pat. No. 5,143,854 and PCT PatentPublication Nos. WO 90/15070 and 92/10092, each of which is herebyincorporated by reference for all purposes. Nucleic acid probes on thechip are used to detect complementary nucleic acid sequences in a samplenucleic acid of interest (the “target” nucleic acid).

[0039]FIG. 1 illustrates an overall system 100 for forming and analyzingarrays of biological materials such as RNA or DNA. A part of system 100is a polymorphism database 102. Polymorphism database 102 includesinformation about, e.g., biological sources, preparation of samples,design of arrays, raw data obtained from applying experiments to chips,analysis procedures applied, and analysis results, etc. Polymorphismdatabase 102 facilitates large scale study of polymorphisms.

[0040] A chip design system 104 is used to design arrays of polymerssuch as biological polymers such as RNA or DNA. Chip design system 104may be, for example, an appropriately programmed Sun Workstation orpersonal computer or workstation, such as an IBM PC equivalent,including appropriate memory and a CPU. Chip design system 104 obtainsinputs from a user regarding chip design objectives includingpolymorphisms of interest, and other inputs regarding the desiredfeatures of the array. Optionally, chip design system 104 from externaldatabases such as GenBank. The output of chip design system 104 is a setof chip design computer files in the form of, for example, a switchmatrix, as described in PCT application WO 92/10092, and otherassociated computer files. The chip design computer files form a part ofpolymorphism database 102. Systems for designing chips for study ofpolymorphisms are disclosed in U.S. Pat. No. 5,571,639 and in PCTapplication WO 95/11995, the contents of which are herein incorporatedby reference.

[0041] The chip design files are input to a mask design system (notshown) that designs the lithographic masks used in the fabrication ofarrays of molecules such as DNA. The mask design system designs thelithographic masks used in the fabrication of probe arrays. The maskdesign system generates mask design files that are then used by a maskconstruction system (not shown) to construct masks or other synthesispatterns such as chrome-on-glass masks for use in the fabrication ofpolymer arrays.

[0042] The masks are used in a synthesis system (not shown). Thesynthesis system includes the necessary hardware and software used tofabricate arrays of polymers on a substrate or chip. The synthesissystem includes a light source and a chemical flow cell on which thesubstrate or chip is placed. A mask is placed between the light sourceand the substrate/chip, and the two are translated relative to eachother at appropriate times for deprotection of selected regions of thechip. Selected chemical reagents are directed through the flow cell forcoupling to deprotected regions, as well as for washing and otheroperations. The substrates fabricated by the synthesis system areoptionally diced into smaller chips. The output of the synthesis systemis a chip ready for application of a target sample.

[0043] Information about the mask design, mask construction, and probearray synthesis is presented by way of background. A biological source112 is, for example, tissue from a plant or animal. Various processingsteps are applied to material from biological source 112 by a samplepreparation system 114. Operation of sample preparation system 114 inthe context of a polymorphism study is discussed below in furtherdetail.

[0044] The prepared samples include nucleic acid sequences such as DNA.When the sample is applied to the chip by a sample exposure system 116,the nucleic acids may or may not bond to the probes. The nucleic acidscan be tagged with fluoroscein labels to determine which probes havebonded to nucleotide sequences from the sample. The prepared sampleswill be placed in a scanning system 118. Scanning system 118 includes adetection device such as a confocal microscope or CCD (charge-coupleddevice) that is used to detect the location where labeled receptors havebound to the substrate. The output of scanning system 118 is an imagefile(s) indicating, in the case of fluorescein labeled receptor, thefluorescence intensity (photon counts or other related measurements,such as voltage) as a function of position on the substrate. These imagefiles may also form a part of polymorphism database 102. Since higherphoton counts will be observed where the labeled nucleic acid(s) hasbound more strongly to the array of probes, and since the monomersequence of the probes on the substrate is known as a function ofposition, it becomes possible to analize the sequence(s) of the nucleicacid(s) that are complementary to the probes.

[0045] The image files and the design of the chips are input to ananalysis system 120 that, e.g., calls bases. Such analysis techniquesare described in EPO Pub. No. 0717113A, the contents of which are hereinincorporated by reference.

[0046] Chip design system 104, analysis system 120 and control portionsof exposure system 116, sample preparation system 114, and scanningsystem 118 may be appropriately programmed computers such as a Sunworkstation or IBM-compatible PC. An independent computer for eachsystem may perform the computer-implemented functions of these systemsor one computer may combine the computerized functions of two or moresystems. One or more computers may maintain chip design database 102independent of the computers operating the systems of FIG. 1 or chipdesign database 102 may be fully or partially maintained by thesecomputers.

[0047]FIG. 2A depicts a block diagram of a host computer system 10suitable for implementing the present invention. Host computer system210 includes a bus 212 which interconnects major subsystems such as acentral processor 214, a system memory 216 (typically RAM), aninput/output (I/O) adapter 218, an external device such as a displayscreen 224 via a display adapter 226, a keyboard 232 and a mouse 234 viaan I/O adapter 218, a SCSI host adapter 236, and a floppy disk drive 238operative to receive a floppy disk 240. SCSI host adapter 236 may act asa storage interface to a fixed disk drive 242 or a CD-ROM player 244operative to receive a CD-ROM 246. Fixed disk 244 may be a part of hostcomputer system 210 or may be separate and accessed through otherinterface systems. A network interface 248 may provide a directconnection to a remote server via a telephone link or to the Internet.Network interface 248 may also connect to a local area network (LAN) orother network interconnecting many computer systems. Many other devicesor subsystems (not shown) may be connected in a similar manner.

[0048] Also, it is not necessary for all of the devices shown in FIG. 2Ato be present to practice the present invention, as discussed below. Thedevices and subsystems may be interconnected in different ways from thatshown in FIG. 2A. The operation of a computer system such as that shownin FIG. 2A is readily known in the art and is not discussed in detail inthis application. Code to implement the present invention, may beoperably disposed or stored in computer-readable storage media such assystem memory 216, fixed disk 242, CD-ROM 246, or floppy disk 240.

[0049]FIG. 2B depicts a network 260 interconnecting multiple computersystems 210. Network 260 may be a local area network (LAN), wide areanetwork (WAN), etc. Bioinformatics database 102 and the computer-relatedoperations of the other elements of FIG. 2B may be divided amongstcomputer systems 210 in any way with network 260 being used tocommunicate information among the various computers. Portable storagemedia such as floppy disks may be used to carry information betweencomputers instead of network 260.

[0050] Overall Description of Database

[0051] Polymorphism database 102 is preferably a relational databasewith a complex internal structure. The structure and contents of chipdesign database 102 will be described with reference to a logical modeldepicted in FIGS. 4A-4H that describes the contents of tables of thedatabase as well as interrelationships among the tables. A visualdepiction of this model will be an Entity Relationship Diagram (ERD)which includes entities, relationships, and attributes. A detaileddiscussion of ERDs is found in “ERwin version 3.0 Methods Guide”available from Logic Works, Inc. of Princeton, N.J., the contents ofwhich are herein incorporated by reference. Those of skill in the artwill appreciate that automated tools such as Developer 2000 availablefrom Oracle will convert the ERD from FIGS. 4A-4H directly intoexecutable code such as SQL code for creating and operating thedatabase.

[0052]FIG. 3 is a key to the ERD that will be used to describe thecontents of chip design database 102. A representative table 302includes one or more key attributes 304 and one or more non-keyattributes 306. Representative table 302 includes one or more recordswhere each record includes fields corresponding to the listedattributes. The contents of the key fields taken together identify anindividual record. In the ERD, each table is represented by a rectangledivided by a horizontal line. The fields or attributes above the lineare key while the fields or attributes below the line are non-key. Anidentifying relationship 308 signifies that the key attribute of aparent table 310 is also a key attribute of a child table 312. Anon-identifying relationship 314 signifies that the key attribute of aparent table 316 is also a non-key attribute of a child table 318. Where(FK) appears in parenthesis, it indicates that an attribute of one tableis a key attribute of another table. Both the depicted non-identifyingand identifying relationship are one to one-or-more relationships whereone record in the parent table corresponds to one or more records in thechild table. An alternative non-identifying relationship 324 is a one tozero-or-more relationship where one record in a parent table 320corresponds to zero or more records in a child table 322.

[0053] Database Model

[0054] FIGS. 4A-4H are entity relationship diagrams (ERDs) showingelements of polymorphism database 102 according to one embodiment of thepresent invention. Each rectangle in the diagram corresponds to a tablein database 102. First, the relationships and general contents of thevarious tables will be described.

[0055] The interrelationships and general contents of the tables ofdatabase 102 will be described first. Then a chart will be presentedlisting and describing all of the fields of the various tables.

[0056]FIG. 4A illustrates core elements of database 102 according to oneembodiment of the present invention. A subject table 402 lists organismsfrom which samples have been extracted for polymorphism analysis orother tissue sources. Samples may also be obtained from tissuecollections not associated with any one identified organism. Informationstored within subject table 402 includes the name, gender, family,position with family, (e.g., father, mother, etc.), and ethnic group.For human subjects, the name and family will preferably be representedin coded form to assure privacy. Associated with each subject is aspecies as listed in a species table 404. Also, a relationship may bedefined among subjects a subject relationship table 406 which includesrecords corresponding to related subjects. These relationships may befather-mother, sibling, twins, etc. Subjects may be part of a group thatis being studied, e.g., a group with a congenital disease, or a toxicreaction to a particular drug. The groups are listed in a subject grouptable 408. Participation of subjects in groups is defined by a subjectparticipation table 410 which lists all group memberships.

[0057] Samples and their attributes are listed in a sample table 412.Each sample has an associated sample type. The sample types are listedin a sample type table 414. Possible sample types include blood, urine,etc. Companies or institutions that provide samples are listed in asample source table 416.

[0058] Database 102 provides an item table 418 that includes records foritems. There are various types of items that correspond to differentstages of the sample preparation process. An “item derivation”transforms an item of one type into an item of another type. Thefollowing table lists various item types and item derivation types for arepresentative embodiment. Item Type Derived from by Item DerivationType Sample other samples pooling Sample other sample splittingExtracted DNA Sample DNA Extraction Target (Sequences of Extracted DNAPCR interest amplified) Fluorescently Labeled Target Labeling TargetHybridized Chip Labeled Target Hybridization (application of target tochip) Stained Hybridized Chip Hybridized Chip Staining

[0059] Item Type Derived from by Item Derivation Type FluorescentlyLabeled Target Labeling Target Hybridized Chip Labeled TargetHybridization (application of target to chip) Stained Hybridized ChipHybridized Chip Staining

[0060] Item derivations are listed in an item derivation table 420. Itshould be noted that derivations need not produce a change between itemtypes. Each item derivation occurs in accordance with a protocol thatcharacterizes the step or steps in the derivation. Protocols are listedin a protocol table 428. Each item derivation is performed by anemployee listed in employee table 432.

[0061] Unused chips are listed in a chip table 422. Hybridized chips(i.e., chips that have had target applied) are listed in a hybridizedchip table 424. A hybridized sample map table 426 lists therelationships between hybridized chips and the samples that have beenapplied to them.

[0062] Stained hybridized chips are scanned in a process referred tohere as a scan experiment. Scan experiments are listed in a scanexperiment table 430. The scan experiment occurs in accordance with aprotocol listed in protocol table 428. The scan experiment is performedby an employee listed in employee table 432.

[0063]FIG. 4B depicts further details of the data model for items anditem derivations. The various item types are listed in an item typetable 434 and the various item derivation types are listed in an itemderivation type table 436. The relationships between successive itemtypes, e.g., sample and target are defined in an item type derivationtable 438. An item has associated attributes. For example, for a target,database 102 may store the concentration, volume, location and/orremaining amount. All item attributes are stored in an item attributetable 440. Item attributes may be shared among multiple items. Forexample, a series of targets may all share a preparation date. An itemattribute item map table 442 implements a many-to-many relationshipbetween item attributes and items. The various types of item attributessuch as preparer, preparation date, etc. are listed in an item attributetype table 444. Each item type has corresponding attribute types. Someattribute types are, however, shared among various item types.Accordingly, there is a many-to-many relationship among item attributetypes and item types that is implemented by an item type map table 446.

[0064] The tables of FIG. 4B represent a powerfully general model of thesample preparation process. Changes in process steps that requirechanges in the type of information that should be stored may beimplemented by changing and adding table contents rather than providingnew tables or changing relationships among tables.

[0065]FIG. 4C depicts a detailed data model for storing informationabout protocols according to the present invention. Protocols as storedin protocol table 428 represent information about particular processesthat have been performed including item derivations, analyses, and scanexperiments. Each protocol has an associated protocol template. Protocoltemplates identify protocol types. For example, one protocol templatemay be a PCR template. All protocols associated with the PCR templateidentify parameters for performing a PCR procedure. Protocol templatesare listed in a protocol template table 448. A parameter table 450 listsall the parameters and their values for all the protocols listed inprotocol table 428. A parameter template table 452 lists the variousparameter types along with default values. An examples of a parametertemplate would be a PCR reaction temperature. The parameter templatewould include a default value for this parameter. Parameter table 450might then list many different PCR reaction temperature values thatwould be used by many different protocols. If a parameter value has notbeen modified by the user, it inherits the standard value of theassociated parameter template. A parameter template set is a set ofparameter templates that are used for a particular purpose, e.g., inassociation with protocols according to one or more protocol templates.Parameter template sets are listed in a parameter template set table454. There are different types of parameter template set and these arelisted in a parameter template set table 456. A mapping betweenparameter template sets and protocol templates is defined by a protocoltemplate set map table 458.

[0066] Protocol templates may have associated lengthy verbal informationabout how to perform protocol steps. A protocol template document table460 stores references to documents that include instructions forperforming protocols.

[0067] As with the items, the data model for protocols defined by FIG.4C is highly general and allows significant changes in the way itemderivations, analyses, and experiments are performed without changingthe underlying data model.

[0068] Referring again to FIG. 4A, there are tables to recordinformation concerning the use of primers in PCR. A fragment table 462lists all the sequence fragments investigated in conjunction withdatabase 102. Associated with each fragment are one or more primer pairsused to amplify the fragment in a PCR process. A primer pair table 464lists all the primer pairs including information about whether theprimer pair actually worked to amplify the fragment. In order to developthe information about the effectiveness of primer pairs, there is a PCRtable 466 that lists records identifying the outcome of multiple PCRoperations. The individual PCR operations are identified by reference toitem derivation table 420.

[0069] A single PCR operation may be used to amplify many differentfragments and thus employ many different primer pairs. Of course, asingle primer pair may be used in multiple PCR operations. There istherefore a many-to-many relationship between PCR operations and primerpairs that is recorded by a primer pair PCR map table 468.

[0070] Information about individual primers is stored in a primer table470. Also, each primer has an associated protocol in protocol table 428that characterizes the primer preparation process. Information aboutprimer orders is listed in a primer order table 472. Each primer orderis to a vendor and the vendors are listed in a vendor table 474. Eachprimer order is made by an employee listed in employee table 432. Aprimer order design map table 476 implements a many-to-many relationshipbetween primer orders and primers.

[0071] The data model described here thus preserves information aboutprimers used in PCR reactions. One can improve results by using primersthat have successfully amplified a given fragment in the past. Sometimesparticular groups of primer pairs cannot be multiplexed together in thesame PCR process. The information preserved here thus permitsexperimenters to make optimal use of expensive and time consuming PCRprocedures.

[0072] It is also useful to preserve information about the chipproduction process and the origin of individual chips. A wafer table 478lists wafers. When chips are produced, many chips are produced at thesame time as part of a single wafer. Chip table 422 stores references towafer table 478 for each chip and the location of each chip on its waferat production time. Sometimes there is analytic significance associatedwith the location of a chip on the wafer. Each wafer is produced as partof a lot and the identify of the lot for each wafer is recorded by wafertable 478 as a reference to a lot table 480 that lists each lot.

[0073]FIG. 4D depicts further details of tables pertaining to chipdesign that are preferably maintained within polymorphism database 102according to one embodiment of the present invention. A tiling designtable 482 lists tiling designs. Each tiling design represents theapplication of a particular tiling format to a sequence to beinvestigated. Tiling formats indicate probe orientation, probe length,and the position within a probe of a single nucleotide polymorphismbeing investigated. In a preferred embodiment, there may be very fewtiling formats and they are listed in a tiling format table 484.

[0074] A particular tiling design includes many atom designs specifyingthe design of a single atom. In one embodiment, an atom is a group oftypically four probes used to investigate a single base position witheach probe hybridizing to a sequence including a different base at thatposition. Atom designs are listed in an atom design table 486. Recordsidentifying the designs of individual probes are listed in a probedesign table 488. A probe design role table 490 indicates the roles ofprobes listed in probe design table 488 in the atom designs of atomdesign table 486. For combinations of probe design and atom design,probe design role table 490 indicates which base the probe hybridizes toat the substitution position and whether the probe represents a match ora mismatch to the wild type.

[0075] A probe data table 492 gives the hybridization intensity valuesfor particular probes designs as determined in particular scanexperiments. Each record of the table also gives the number of pixelsused to determine the intensity value and the standard deviation ofintensity as measured among the pixels.

[0076] FIGS. 4E-4G depict aspects of polymorphism database 102 relatedto analysis procedures and their results according to one embodiment ofthe present invention. An analysis table 494 lists analyses performed.An analysis generally refers to a non-trivial transformation of data.Records of analysis table 494 include references to protocol table 428to specify parameters used for each analysis. Analyses may take as theirinput raw data or the results of previous analyses. An analysisdependency table 496 lists dependencies among analyses where oneanalysis depends on the data developed by another analysis. An analysisinput table 498 lists inputs for analyses listed in analysis table 494.

[0077] On the right side of FIG. 4E are various tables used to supportanalyses. A chip design sequence map table 500 correlates particularfragments with chip designs. A sequence position table 502 listsinvestigated sequence positions indicating their positions on afragment. Records of sequence position table 502 reference a genomicsequence position table 504 which gives sequence positions in the genomerather than within individual fragments.

[0078] A scan experiment set table 506 lists sets of scan experiments.This allows for groupings of experiments for individuals or populationsto serve as the basis for polymorphism analysis. A scan experiment usedtable 508 lists records indicating memberships of a scan experiment in ascan experiment set.

[0079] A tiling data table 510 lists records identifying tiling designsas implemented in particular chips measured by particular scanexperiments. An atom data table 512 lists the intensities measured forparticular sequence positions as measured in scan experiments identifiedby the tiling data records. A subject sequence position data table 514lists combinations of sequence position and scan experiment.

[0080] A series of tables in FIGS. 4E-4G correspond to different typesof analysis that occur during the course of a polymorphisminvestigation. The types presented here are merely representative. Aparallel series of tables provide the analysis results. A polymorphismanalysis table 516 lists references to analysis table 494. The resultsof the performed polymorphism analyses are listed in a polymorphismposition result table 518. A record of this table gives a result for apolymorphism analysis for a particular position as determined based on aparticular set of scan experiments. In one embodiment the result iswhether a particular mutation is certain, likely, possible, or notpossible at the position. The result may also be that the reference iswrong.

[0081] A user polymorphism analysis table 520 lists user interpretationsof results as listed in polymorphism position result table 518. Therecords of user polymorphism analysis table 520 are references toanalysis table 494. The user interpretations themselves are stored in auser polymorphism analysis result table 522. Each result is a likelihoodof a particular mutation at a position as considered by a user plus anaccompanying user comment.

[0082] A P-Hat analysis estimates the relative concentrations of wildtype sequence and sequence having a particular mutation as determined ina particular scan experiment. A P-Hat analysis table 524 listsreferences to analysis table 494. An atom result table 526 givesestimates of the relative concentration along with upper and lowerbounds and a maximum intensity. For heterozygous mutations, theestimates of relative concentration will cluster around 0.5 Forhomozygous mutations, the estimates should cluster around 1.0.

[0083] Base call analyses are determinations of the base at a particularposition for a particular individual that may be based on more than oneexperiments. A base call analysis table 528 lists references to analysistable 494. A base call result table 530 lists the called bases forparticular combinations of sequence position and subject.

[0084] A P-Hat grouping analysis determines a measure of likelihood thatdata in a set of scan experiments results from separate genotypes. P-hatgrouping analyses are listed in a p-hat grouping analysis table 532 byreference to analysis table 494. P-hat grouping analysis results arelisted in a mutation fraction result table 534. A group separation isgiven for various combinations of sequence position and scan experimentset.

[0085] A clustering analysis determines an alternative measure oflikelihood that data in a set of scan experiments results from separategenotypes. Clustering analyses are listed in a clustering analysis table536 by reference to analysis table 494. Clustering analysis results arelisted in a clustering result table 538. A clustering factor is givenfor various combinations of sequence position and scan experiment set.

[0086]FIG. 4F shows tables which support normalization and footprintfinding operations that support the analyses referred to in FIG. 4E.Hybridization intensity measurements made in scan experiments should benormalized over a set of scan experiments. The normalization should takeinto account differences in amplification level produced by differentPCR processes.

[0087] Normalization is done by region of sequence. A normalizationregion analysis determines the boundaries of a region to be normalized.The determination of boundaries takes into account that differentfragments of sequence are amplified by different PCR procedures. Anormalization region analysis table 540 lists normalization regionanalyses by reference to analysis table 494. A normalization regionresult table 542 lists the boundaries for each determined normalizationregion.

[0088] Normalization values for identified normalization regions arethemselves determined by normalization analyses. Normalization analysesare listed in a normalization analysis table 544 by reference toanalysis table 494. A normalization result table 546 lists thenormalization values for regions.

[0089] A footprint analysis determines regions of sequence for which thehybridization intensity is elevated for the purposes of quality control.Footprint analyses are listed in a footprint analysis table 548 byreference to analysis table 494. Footprints are identified by sequencestarting point and ending point in a particular scan experiment in afootprint table 550.

[0090]FIG. 4G depicts tables pertaining to measurement quality accordingto one embodiment of the present invention. A tiling data qualityanalysis determines the quality of results from a scan experiment. Theseanalyses are listed in a tiling data quality analysis table 552 byreference to analysis table 494. Tiling data quality analysis resultsare listed in a tiling data quality result table 554. The resultsinclude an average hybridization intensity value for perfect match ormismatch probes. A wild type call rate gives the fraction of atom datawhere the probe corresponding to the reference base has the highesthybridization intensity. A wild type call rate of around 1.0 indicatesgood quality. Where the call rate is less than 0.75, the scan experimentshould be rejected. An accept data field indicates whether the analysisindicates rejection or acceptance.

[0091] Where scan experiment measurements indicate two or more non-wildtype bases within a probe length, this indicates a measurement problemfor the affected region of sequence. These regions are identified bydifficult region analyses listed in a difficult region analysis table556 by reference to analysis table 494. A difficult region result table558 lists the regions identified as being difficult.

[0092] Analysis dependency table 496 indicates interrelationships amongthe various analyses of FIGS. 4E-4G. A footprint analysis may depend ona normalization analysis which may in turn depend on a normalizationregion analysis. A basecall analysis or PHatGrouping analysis may dependon an atom analysis. A polymorphism analysis may depend on any of theseanalyses and/or a user polymorphism analysis and/or a clusteringanalysis.

[0093] Another aspect of the investigation of polymorphisms is seekingpatent protection for identified polymorphisms. FIG. 4H shows tables ofpolymorphism database 102 related to efforts to seek patent protectionaccording to one embodiment of the present invention. A polymorphismpatent sequence table 560 lists sequences for which patent protection issought. A patent application table 562 lists patent applicationsdirected toward the protection of polymorphisms. A polymer patentapplication sequence map table 564 implements a many-to-manyrelationship between patent applications and sequences. A priorapplication table 566 lists relationships between patent applicationsand prior related patent applications. An attorney table 568 listsattorneys responsible for preparing patent applications listed in patentapplication table 562. A law firm table 570 lists the law firms to whichthe attorneys listed in attorney table 568 belong.

[0094] An employee group table 572 lists groups of inventors for thepatent applications listed in table 562. Individual inventors are listedin employee table 432. An employee group map table 574 implements amany-to-many relationship between inventors and groups of inventors.

[0095] The data model of FIG. 4H greatly facilitates the process ofsecuring patent protection for polymorphisms and thereby increases thecommercial incentive for investigation of polymorphisms.

[0096] Database Contents

[0097] The contents of the tables introduced above will now be presentedin greater detail in the following chart. TABLE FIELD COMMENT tblSubjectSubjectId: INTEGER Identifies biological source of sample. SpeciesID:INTEGER Species of subject. Name: VARCHAR2(20) Name of subject(anonimized for human subjects). Gender: VARCHAR2(10) Gender of subject.Family: VARCHAR2(20) Family of subject (anonimized for human subjects).Member: SMALLINT Position in family (father, mother, etc.). Group:VARCHAR2(20) Ethnic group. CellLineID: VARCHAR2(20) Identifier forsample source not associated with particular organism. IsReference:SMALLINT Whether or not subject is in a group. tblSpecies SpeciesId:INTEGER Species identifier. Name: VARCHAR2(30) Name of species.SubjectRelationship Subject1: INTEGER First subject in relationshipSubject2: INTEGER Second subject in relationship. Position: VARCHAR2(2)Nature of relationship. tblSubjectGroup GroupId: INTEGER Identifier ofgroup of subjects (not same as ethnic group). GroupCode: VARCHAR2(20)Code identifier for group. Comments: LONG VARCHAR User comments ongroup. upsize_ts: DATE Creation date for group. tblSubjectParticipationSubjectId: INTEGER Reference to subject table. GroupId: INTEGERReference to subject group table. tblSample SampleId: INTEGER Sampleidentifier. SubjectID: INTEGER Reference to subject table.SampleSourceId: CHAR(18) Institutional source of sample. Code:VARCHAR2(20) Code representing individual subject. Recipient:VARCHAR2(20) Person accepting sample. Provider: VARCHAR2(20) Person orinstitution providing sample. DateReceived: DATE Date sample received.ProtocolId: INTEGER Reference to protocol table. SampleTypeId: INTEGERReference to sample type table. tblSampleType SampleTypeId: INTEGERSample type identifier. Description: VARCHAR2(50) Description of sampletype. tblSample Source SampleSourceId: CHAR(18) Identifier ofinstitutional sample source. ProviderName: VARCHAR2(20) Name ofindividual or institutional sample provider. Item ItemId: INTEGER Itemidentifier. ItemTypeId: INTEGER Item type identifier. ItemName:VARCHAR2(50) Name of item. ItemDerivation Item1Id: INTEGER Derivationsource. Item2Id: INTEGER Derivation result. EmployeeId: INTEGER Employeeresponsible for derivation. DerivationTypeId: INTEGER Derivation typeidentifier. Protocolid: VARCHAR2(18) Reference to protocol table. Date:DATE Date of derivation. tblChip ChipId: INTEGER Rename reference toitem table. ChipDesignPlacementId: INTEGER Placement on wafer.LocationId: INTEGER Location of chip. WaferId: INTEGER Wafer the chipwas on. tblHybedChip HybedChipId: INTEGER Rename reference to itemtable. SubjectID: INTEGER Reference to subject table. ProtocolId:INTEGER Reference to protocol table. Repetition: SMALLINT Refers tonumber of times chip has been washed and reused. tblHybSampleMap ItemId:INTEGER Reference to item table. Protocol ProtocolId: INTEGER Protocolidentifier. ProtocolTemplateId: INTEGER Protocol template identifier.Name: VARCHAR2(100) Name of protocol. tblScanExperiment ScanExptId:INTEGER Scan experiment identifier. ItemId: INTEGER Reference to itemtable. ScanCode: VARCHAR2(25) File for scan results. ProtocolId:INTEGERP Reference to protocol table. ScanRatingId: INTEGER Assessmentof scan quality. ExperimenterId: INTEGER Experimenter identifier. Date:DATE Date of experiment. ConversionTool: VARCHAR2(50) Program used toconvert from scan image to intensities. ConversionDate: DATE Date ofconversion. ScanStatus: VARCHAR2(50) whether or not scan image has beenconverted to intensities Comments: LONG VARCHAR Comments. EmployeeEmployeeId: INTEGER Employee identifier. EmployeeCode: VARCHAR2(5) Codefor employee FName: VARCHAR2(20) First name of employee. MName:VARCHAR2(20) Middle name of employee. LName: VARCHAR2(20) Last name ofemployee. ItemType ItemId: INTEGER Item type identifier. ItemTypeName:VARCHAR2(30) Name of item type. FormName: VARCHAR2(100) Reference touser interface form for item type. ItemDerivationType DerivationTypeId:INTEGER Derivation type identifier. DerivationType: VARCHAR2(50)Description of derivation type. ItemTypeDerivation NextItemTypeId:INTEGER Result type of derivation. ItemTypeId: INTEGER Source type ofderivation. ItemAttribute itemAttributeId: INTEGER Item attributeidentifier. ItemAttributeTypeId: INTEGER Reference to item attributetype table. Attribute: VARCHAR2(50) Attribute value.ItemAttributeItemMap ItemAttributeId: INGEGER Reference to itemattribute table. ItemId: INTEGER Reference to item table.ItemAttributeType ItemAttributetypeId: INTEGER Item attributeidentifier. ItemAttributeName: VARCHAR2(30) Name of item attribute type.ItemTypeMap ItemAttributeTypeId: INTEGER Reference to item attributetype table. ItemTypeId: INTEGER Reference to item type table.ProtocolTemplate ProtocolTemplateId: INTEGER Protocol templateidentifier. Name: VARCHAR2(100) Name of protocol template. DateCreated:DATE Date protocol template created. FormName: VARCHAR2(50) Name of theelectronic form used for protocol template. Parameter ParameterId:INTEGER Parameter identifier. ParameterTemplateId: INTEGER Reference toparameter template table. Value: VARCHAR2(20) Value of parameter.ProtocolID: INTEGER Reference to protocol table. ParameterTemplateParameterTemplateId: INTEGER Parameter template identifier. Name:VARCHAR2(100) Name of parameter template. ParamTemplateSetId: INTEGERReference to parameter template set table. StandardValue: VARCHAR2(100)Default value for parameter. ParamTemplateSet ParamTemplateSetId:INTEGER Parameter template set identifier. TypeId: INTEGER Renamedreference to parameter template set type table. Name: VARCHAR2(20) Nameof parameter template set. ParamTemplateSetType ParamTempSetTypeId:INTEGER Parameter template set type identifier. Description:VARCHAR2(50) Description of parameter template set type.ParameterTemplateSetMap ProtocolTemplateId: INTEGER Reference toprotocol template table. ParamTemplateSetId: INTEGER Reference toparameter template set table. ProtocolTemplateDoc ProtocolDocId: INTEGERProtocol Template document identifier. ProtocolTemplateId: INTEGERReference to protcol template table. Name: VARCHAR2(100) Name ofprotocol template. PathAndFileName: VARCHAR2(50) File name for protocoltemplate document. AuthorName: INTEGER Author of protocol templatedocument. CreationDate: DATE Creation date of protocol templatedocument. tbFragment FragmentId: INTEGER Fragment identifier.ChipSequence: LONG VARCHAR Sequence of fragment. Code: VARCHAR2(50) Coderepresenting fragment. tblPrimerPair PrimerPairId: INTEGER Identifierfor primer pair. LeftPrimerId: INTEGER Left primer identifier.RightPrimerId: INTEGER Right primer identifier. PCRSize: INTEGER lengthof amplified fragment Worked: SMALLINT Whether or not pair successfullyamplified fragment. FragmentId: INTEGER Reference to fragment table.tblPCR Item1Id: INTEGER First part of reference to item derivationtable. Item2Id: INTEGER Second part of reference to item derivationtable. Reactionworked: SMALLINT Whether or not PCR reaction worked.PrimePairPCRMap PrimerPairId: INTEGER Reference to primer pair table.Item1Id: INTEGER First part of referenced item derivation table.Item2Id: INTEGER Second part of referenced item derivation table.tblPrimer PrimerId: INTEGER Primer identifier. ProtocolId: INTEGERReference to protocol table. OligoSeq: VARCHAR2(35) Sequence of primer.Position: INTEGER Position of primer on fragment. Length: INTEGER Lengthof primer. MeltingTemp: INTEGER Melting temperature of primer.Direction: VARCHAR2(20) Direction (forward or reverse). tblPrimerOrderOrderId: INTEGER Order identifier. EmployeeId: INTEGER Employee who madeorder. VendorId: INTEGER Vendor for order. OrderDate: DATE Date oforder. Owner: VARCHAR2(50) Name of employee making order. Vendor:VARCHAR2(50) Name of vendor. tbl Vendor VendorId: INTEGER Vendoridentifier. Vendor: VARCHAR2(50) Name of vendor. PhoneNumber:VARCHAR2(15) Phone number of vendor. FaxNumber: VARCHAR2(15) Fax Numberof vendor. Address: VARCHAR2(50) Address of vendor. City: VARCHAR2(50)City of vendor. State: VARCHAR2(50) State of vendor. Zip: VARCHAR2(50)Zip code of vendor. tblPrimerOrderDesignMap PrimerId: INTEGER Referenceto primer table. OrderId: INTEGER Reference to order table. tblWaferWaferId: INTEGER Wafer identifier. LotId: INTEGER Lot to which waferbelongs. Code: VARCHAR2(8) Code for wafer. SynthesisDate_delete: DATESynthesis date for wafer. Released: DATE Date wafer available. Done:SMALLINT Whether wafer production is complete. ExpirationDate: DATEExpiration date of wafer. ExpectedLife: CHAR(18) Expected useful life ofwafer. tblLot LotId: INTEGER Lot identifier. WaferDesignId: INTEGERIdentifier for wafer design. LotNumber: VARCHAR2(12) Lot number.WaferPN: VARCHAR2(50) Part number for wafer. tblTiling DesignTilingDesignID: INTEGER Tiling design identifier.ChipDesignSequenceMapID: NUMBER Reference to chip design sequence map.TilingFormatID: INTEGER Reference to tiling format table. UnitNumber:INTEGER 1 for sense, 0 for antisense AtomOffset: INTEGER # to add totranslate atom position in tiling to atom position in chip designtblTiling Format TilingFormatID: INTEGER Tiling format identifierOrientation: CHAR(18) Orientation for tiling. ProbeLength: SMALLINTLength of probes. SubstitutionPosition: SMALLINT Substitution positionfor mutation base in probes. tblAtomDesign AtomDesignId: NUMBER Atomdesign identifier. TilingDesignID: INTEGER Reference to tiling designtable. Position: INTEGER Position of atom in sequence. tblProbeDesignProbeDesignID: NUMBER Probe design identifier. ChipDesignId: INTEGERReference to chip design x: SMALLINT x position of probe. y: SMALLINT yposition of probe. tblProbeDesignRole ProbeDesignID: NUMBER Reference toprobe design table. AtomDesignID: NUMBER Reference to atom design table.Substitution: CHAR(18) Substitution position in probe design.Mismatches: NUMBER Whether probe is match or mismatch. tblProbeDataProbeDesignID: NUMBER Reference to probe design table. ScanExptID:INTEGER Reference to scan experiment table. Intensity: FLOAT Measuredhybridization intensity for probe. NPixels; NUMBER Number of pixels usedfor intensity calculation. StDev: NUMBER Standard deviation for pixels.tblAnalysis AnalysisId: INTEGER Analysis identifier. Analysis VersionID:INTEGER Reference to version of analysis. ProtocolID: INTEGER Referenceto protocol table. DatePerformed: DATE Date analysis performed.NeedsUpdate: NUMBER Whether analysis is current. tblAnalysisDependencyParentAnalysisId: INTEGER Analysis providing input. SubAnalysisId:INTEGER Analysis receiving input. Role: VARCHAR2(20) Role of dataprovided by parent analysis. TblAnalysisInput AnalysisinputID: INTEGERAnalysis input identifier. AnalysisId: INTEGER Analysis receiving input.Inputtype: VARCHAR2(20) Type of input. ObjectID: INTEGER Reference toinput data. tblChipDesignSeguenceMap ChipDesignSequenceMapID: NUMBERChip design sequence map identifier. FragmentID: INTEGER Reference tofragment table. ChipDesignId: INTEGER Chip design identifier.AtomOffset: NUMBER # to add to translate atom position in tiling to atomposition in chip design tblSequencePosition SequencePositionID: NUMBERSequence position identifier. ChipDesignSequenceMapID: NUMBER Referenceto chip design sequence map table. Position: NUMBER Position infragment. GenomicSequencePositionID: INTEGER Reference to genomicsequence position table. RefBase: INTEGER Reference base.tblGenomicSequencePosition GenomicSequencePositionID: INTEGER Genomicsequence position identifier. tblScanExperimentSet ScanExperimentSetID:NUMBER Scan experiment set identifier. tbsScanExperimentUsed ScanExptID:INTEGER Reference to scan experiment table. ScanExperimentSetID: NUMBERReference to scan experiment set table. tblTilingData TilingDataID:NUMBER Tiling data identifier. ScanExptID: INTEGER Reference to scanexperiment table. TilingDesignID: INTEGER Reference to tiling designtable. tblAtomData AtomDataID: INTEGER Atom data identifier.TilingDataID: NUMBER Reference to tiling data table.SubjectSequencePositionID: INTEGER Reference to subject sequenceposition table. tblSubjectSequencePosition SubjectSequencePositionID:INTEGER Subject sequence position identifier. SubjectID: INTEGERReference to subject table. SequencePositionID: NUMBER Reference tosequence position table. tblPolymorphismAnalysis AnalysisId: INTEGERReference to analysis table. tblPolyPositionResult AnalysisId: INTEGERReference to analysis table. PolyPositionID: INTEGER Polymorphismposition identifier. ScanExperimentSetID: NUMBER Reference to scanexperiment set table. PolyPositiontypeID: INTEGER Refers to possibilityof polymorphism at position, e.g., certain, likely, possible, mismatch(reference is wrong). WTBase: CHAR(18) Wild type base at position.MuBase: INTEGER Mutation base at position. tblUserPolyanalysisAnalysisId: INTEGER Reference to analysis table.tblUserPolyanalysisResult AnalysisId: INTEGER Reference to analysistable. SequencePositionID: NUMBER Reference to sequence position table.ScanExperimentSetID: NUMBER Reference to scan experiment set table.PolyPositionTypeID: INTEGER See polymorphism position result table.UserComment: VARCHAR2(256) User comment done polymorphism analysis.tblAtomanalysis AnalysisId: INTEGER Reference to analysis table.tblAtomResult AnalysisId: INTEGER Reference to analysis table.AtomDataID: INTEGER Reference to atom data table. PHat: FLOAT Relativeconcentration of mutant and wild type. PHatUpperbound: FLOAT Upperboundfor relative concentration. PHatLowerbound: FLOAT Lowerbound forrelative concentration. MaxIntensity: FLOAT Maximum measured intensityfor atom. WTIntensity: FLOAT Measured wild type intensity. MutIntensity:FLOAT Measured mutation intensity. LocalWTCallRate: FLOAT rate at whichatoms associated with surrounding sequence call reference baseIntensityRatio: FLOAT Ratio of intensity of wild type probe overintensity of mutation probe. tblBaseCallAnalysis AnalysisId: INTEGERReference to analysis table. tblBaseCallResult AnalysisId: INTEGERReference to analysis table. SubjectSequencePositionID: INTEGERReference to sequence position table. ScanExperimentSetID: NUMBERReference to skin experiments set table. CalledBase: VARCHAR2(1) Basecalled for subject based on experiment set. SuggestCheck: NUMBER Used toindicate whether this sample should be used for resequencingtblClusteringAnalysis AnalysisId: INTEGER Reference to analysis table.tblClusteringResult AnalysisId: INTEGER Reference to analysis table.SequencePositionID: NUMBER Reference to sequence position table.ScanExperimentSetID: NUMBER Reference to scan experiment set table.ClusteringFactor: FLOAT Result of clustering analysis.tblNormalizationRegionAnalysis AnalysisId: INTEGER Reference to analysistable. tblNormalizationRegion NormalizationRegionID: INTEGERNormalization region identifier. AnalysisId: INTEGER Reference toanalysis table. ChipDesignSequenceMapID: NUMBER Reference to chip designsequence map table. NumberScanExpt.Set Reference to scan experiment settable. RegionEnd: INTEGER Indication of end of the normalization region.RegionStart: INTEGER Indication of beginning of the normalizationregion. tblNormalizationAnalysis AnalysisId: INTEGER Reference toanalysis table. tblNormalizationResult NormalizationResultID: INTEGERNormalization result identifier. AnalysisId: INTEGER Reference toanalysis table. TilingDataID: INTEGER Reference to tiling data table.NormalizationRegionResultID: INTEGER Reference to normalization result.NormalizationValue: NUMBER Value used for normalization. DataOK: NUMBERIndication whether normalization result is usable. tblFootprintAnalysisAnalysisId: INTEGER Reference to analysis table. tblFootprintFootprintID: NUMBER Footprint identifier. AnalysisId: INTEGER Analysisidentifier. ChipDesignSequenceMapID: NUMBER Reference to chip designsequence map table. ScanExperimentSetID: NUMBER Reference to scanexperiment set table. FFStart: NUMBER Start of footprint and sequence.FPEnd: NUMBER End of footprint and sequence.tblTilingDataQualityAnalysis AnalysisId: INTEGER Reference to analysistable. tbltilingDataQualityResult TilingDataID: NUMBER Reference totiling data table. AnalysisId: INTEGER Reference to analysis table.AvgWTIntensity: NUMBER Average wiId type intensity. WTCallRate: NUMBERFraction of atoms where brightest of probes is one with reference space.AcceptData: INTEGER Whether data is of acceptable quality. tblDifficultRegionanalysis AnalysisId: INTEGER Reference to analysis table.tblDifficultRegionResult ScanExptId: INTEGER Reference to scanexperiment table. AnalysisId: INTEGER Reference to analysis table.ChipDesignSequenceMapID: NUMBER Reference to chip design sequence maptable. RgnStart: NUMBER Beginning of difficult region in sequence.RgnEnd: NUMBER End of difficult region in sequence. Reason: INTEGER Codeindicating reason for difficult region, e.g., two or more non-wild typebases and less than a probe length.q tblPolyPatentSeq PolyPatentSeqId:NUMBER Polymorphism sequence identifier. Polyscreen: VARCHAR2(50)reference to internal grouping of polymorphisms FragmentCode:VARCHAR2(50) Fragment sequence found in Position: LONG Position ofpolymorphism. RefAllel: CHAR(2) Wild type base at position. FreqP: FLOATFrequency of wild type. AltAllele: CHAR(2) Mutation base at position.FreqQ: FLOAT Frequency of mutation base. Heterozygocity: FLOATHeterozygocity value. SequenceTag: VARCHAR2(50) Sequence containingpolymorphism including ambiguity code at polymorphism position.GeneName: VARCHAR2(50) Name of gene. ChromosomeNum: VARCHAR2(20)Chromosome number. ChromosomeLoc: VARCHAR2(20) Location of gene onchromosome. ForwardPrimer: VARCHAR2(50) Identifier for forward primerused to implement fragment. ReversePrimer: VARCHAR2(50) Identifier ofprimer used to amplify fragment. tblPatentApp PatentAppId: NUMBER Patentapplication identifier. GroupId: NUMBER Reference to employee grouptable. AttorneyId: NUMBER Reference to attorney table. DocketNum:VARCHAR2(30) Docket number for patent application. FilingDate: DATEFiling date for filing application. Classification: VARCHAR2(30) Patentoffice classification for patent application. SerialNumber: VARCHAR2(50)Serial number assigned by patent office. CountryCode: VARCHAR2(50)Country in which patent application was filed. InventionTitle:VARCHAR2(100) Title for patent application tblPolyPatentSeqMapPatentAppId: NUMBER Reference to patent application table.PolyPatentSeqId: NUMBER Reference to polymorphism patent sequence table.tblPriorApp PriorAppId: NUMBER Reference to related prior patentapplication in patent application table. AppId: NUMBER Reference toapplication to which prior application is related. tblAttorneyAttorneyId: NUMBER Attorney identifier. LawFirmId: NUMBER Law firm whereattorney works. FirstName: VARCHAR2(20) First name of attorney.MiddleName: VARCHAR2(5) Middle name of attorney. LastName: VARCHAR2(30)Last name of attorney. RegistrationNum: VARCHAR2(25) Patent officeregistration number of attorney. tblLawFirm LawFirmId: NUMBER Law firmidentifier. Company: VARCHAR2(100) Name of law firm. Address:VARCHAR2(100) Address of law firm. City: VARCHAR2(30) City address oflaw firm. State: VARCHAR2(20) State address of law firm. ZipCode:VARCHAR2(15) Zip Code of law firm. Country: VARCHAR2(15) Country of lawfirm. Telephone: VARCHAR2(30) Telephone Fax: VARCHAR2(30) number of lawfirm. TELEX: VARCHAR2(20) Facsimile number of law firm. Telex number oflaw firm. tblEmployeeGroup GroupId: NUMBER Identifier for inventorgroup. GroupName: VARCHAR2(50) Name of inventor group. Comments:VARCHAR2(50) Comments. GroupList: VARCHAR2(255) Written out list ofinventor names. tblEmployeeGrpMap EmployeeId: INTEGER Reference toemployee table for inventor/em- ployees. GroupId: NUMBER Reference toinventor group table.

[0098] It is understood that the examples and embodiments describedherein are for illustrative purposes only and that various modificationsor changes in light thereof will be suggested to persons skilled in theart and are to be included within the spirit and purview of thisapplication and scope of the appended claims. For example, tables may bedeleted, contents of multiple tables may be consolidated, or contents ofone or more tables may be distributed among more tables than describedherein to improve query speeds and/or to aid system maintenance. Also,the database architecture and data models described herein are notlimited to biological applications but may be used in any application.All publications, patents, and patent applications cited herein arehereby incorporated by reference.

What is claimed is:
 1. A computer-readable storage medium having storedthereon: an item table listing a plurality of item records identifyingitems; an item attribute table listing a plurality of item attributerecords identifying attributes of said items; and wherein there is amany-to-many relationship between item records and item attributerecords.
 2. The computer-readable storage medium of claim 1 wherein anitem attribute item map table implements said many-to-many relationshipbetween item records and item attribute records, said item attributeitem map table listing a plurality of map records identifying both aparticular item attribute and a particular item.
 3. Thecomputer-readable storage medium of claim 1 having further storedthereon: an item derivation table listing a plurality of item derivationrecords identifying transformations between ones of said items used inbiological analysis.
 4. The computer-readable storage medium of claim 3having further stored thereon: a protocol table listing a plurality ofprotocol records specifying parameters of said transformation.
 5. Thecomputer-readable storage medium wherein said items are used in abiological analysis.
 6. The computer-readable storage medium of claim 1wherein said biological analysis comprises a polymorphism analysis.
 7. Acomputer-readable storage medium having stored thereon: an atom resulttable listing a plurality of atom result records, specifying relativewild-type and mutant sequence concentrations in targets; and a subjectsequence position table listing a plurality of subject sequence positionrecords, specifying combinations of subjects from whom said targets arederived and sequence positions, each said atom result record beingassociated with one or more atom result records.
 8. Thecomputer-readable storage medium of claim 7 wherein said atom resultrecords further specify upper and lower bounds for said concentrations.9. The computer-readable storage medium of claim 7 having further storedthereon: a subject table listing subject records specifying saidsubjects.
 10. A computer-readable storage medium having stored thereon:a polymorphism table listing polymorphism sequence records specifyingsequences known to contain polymorphisms; and a patent application tablelisting patent application records specifying one or more polymorphismsspecified by said polymorphism sequence records.
 11. Thecomputer-readable storage medium of claim 10 wherein said polymorphismsequence records specify for each one of said polymorphisms apolymorphism position, a reference allele, and a base allele.
 12. Thecomputer-readable storage medium of claim 11 wherein said polymorphismsequence records further specify for each one of said polymorphisms ameasured heterozygocity.
 13. A computer-implemented method comprising:creating n item table listing a plurality of item records identifyingitems used in biological analysis; and creating an item attribute tablelisting a plurality of item attribute records identifying attributes ofsaid items; and wherein there is a many-to-many relationship betweenitem records and item attribute records.
 14. The computer-implementedmethod of claim 13 further comprising the step of: creating an itemattribute item map table implements said many-to-many relationshipbetween item records and item attribute records, said item attributeitem map table listing a plurality of map records identifying both aparticular item attribute and a particular item.
 15. Thecomputer-implemented method of claim 13 comprising: an item derivationtable listing a plurality of item derivation records identifyingtransformations between ones of said items used in biological analysis.16. The computer-implemented method of claim 15 further comprising:creating a protocol table listing a plurality of protocol recordsspecifying parameters of said transformation.
 17. Thecomputer-implemented method of claim 13 wherein said biological analysiscomprises a polymorphism analysis.
 18. A computer-implemented methodcomprising: creating an atom result table listing a plurality of atomresult records, specifying relative wild-type and mutant sequenceconcentrations in targets; and creating a subject sequence positiontable listing a plurality of subject sequence position records,specifying combinations of subjects from whom said targets are derivedand sequence positions, each said atom result record being associatedwith one or more atom result records.
 19. The computer-implementedmethod of claim 18 wherein said atom result records further specifyupper and lower bounds for said concentrations.
 20. Thecomputer-implemented method of claim 18 further comprising: creating asubject table listing subject records specifying said subjects.
 21. Acomputer-implemented method comprising: creating a polymorphism tablelisting polymorphism sequence records specifying sequences known tocontain polymorphisms; and creating a patent application table listingpatent application records specifying one or more polymorphismsspecified by said polymorphism sequence records.
 22. Thecomputer-implemented method of claim 21 wherein said polymorphismsequence records specify for each one of said polymorphisms apolymorphism position, a reference allele, and a base allele.
 23. Thecomputer-implemented method of claim 22 wherein said polymorphismsequence records further specify for at least one of said polymorphismsa measured heterozygocity.